[LLaDA 2.0-Uni] Add LLaDA 2.0-Uni multimodal discrete diffusion pipeline by ChinChyi · Pull Request #13686 · huggingface/diffusers

ChinChyi · 2026-05-06T17:18:28Z

What does this PR do?

Adds support for LLaDA 2.0-Uni, a unified multimodal discrete diffusion language model that supports text understanding, image understanding, and image generation in a single framework.

Paper: LLaDA2.0-Uni: Unifying Multimodal Understanding and Generation with Diffusion Large Language Model

New Components

LLaDA2UniImageTransformer2DModel — Image diffusion transformer for decoding VQ tokens to images
UniLLaDaPipeline — Unified pipeline supporting three modes:
- Text-to-image generation
- Image understanding (VQA, captioning)
- Image editing
LLaDA2UniFlowMatchEulerScheduler — Flow matching scheduler with Euler ODE integration
Image tokenizer utilities — SigVQ-based image encoding/decoding

Key Features

Multimodal capabilities: Single model handles both vision and language tasks
Discrete diffusion: Block-wise iterative refinement for token generation
FP8 quantization support: Efficient inference with quantized weights
Flexible decoding: Supports both quality mode (50 steps) and turbo mode (8 steps)

Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from diffusers import UniLLaDaPipeline, BlockRefinementScheduler
from diffusers.pipelines.unillada.image_tokenizer import ImageTokenizer

model_id = "inclusionAI/LLaDA2.0-Uni"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
scheduler = BlockRefinementScheduler()
image_tokenizer = ImageTokenizer(model_path=model_id)

pipe = UniLLaDaPipeline(
    transformer=model,
    tokenizer=tokenizer,
    scheduler=scheduler,
    image_tokenizer=image_tokenizer,
)

# Text-to-Image
result = pipe(prompt="A cat sitting on a windowsill at sunset")
result.images[0].save("output.png")

# Image Understanding
from PIL import Image
img = Image.open("photo.jpg")
result = pipe(image=img, question="Describe this image in detail.")
print(result.text)

# Image Editing
result = pipe(image=img, instruction="Change the background to a beach.")
result.images[0].save("edited.png")

Testing

Added unit tests in tests/pipelines/unillada/test_unillada.py
Tests cover all three modes (generation, understanding, editing)
Mock components for CI compatibility

Model Weights

Official weights available at: https://huggingface.co/inclusionAI/LLaDA2.0-Uni

Before submitting

Did you read the contributor guideline?
Did you read our philosophy doc?
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

@yiyixuxu @a-r-r-o-w @DN6

Add UniLLaDA pipeline supporting text-to-image, image understanding, and image editing via block-wise iterative discrete diffusion. New components: - UniLLaDaPipeline: main pipeline (DiffusionPipeline subclass) - LLaDA2UniImageTransformer2DModel: image transformer model - LLaDA2UniFlowMatchEulerScheduler: flow matching scheduler - ImageTokenizer: VQ image encoder helper - Documentation and tests

dg845 · 2026-05-15T10:10:34Z

+        return torch.cat(result, dim=-1)
+
+
+class LLaDA2UniImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):


Is LLaDA2UniImageTransformer2DModel intended to be used as part of the UniLLaDA pipeline? I see that the transformer loaded by the pipeline is a remotely implemented transformers model (LLaDA2MoeModelLM in modeling_llada2uni_moe.py), and this transformer doesn't appear to be used anywhere.

dg845 · 2026-05-15T10:12:33Z

+        >>> model = AutoModelForCausalLM.from_pretrained(
+        ...     model_id, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="auto"
+        ... )


I would prefer a diffusers-native (or transformers-native) implemention of the DiT model so that we don't need trust_remote_code=True.

it's really ok to have trust_remote_code here, ideally transformer native but it is up to them
we are not going to host transformer models in diffusers

dg845 · 2026-05-15T10:14:54Z

+        return_dict: bool,
+    ) -> UniLLaDaPipelineOutput | tuple:
+        """Text-to-image generation."""
+        result = self.transformer.generate_image(


I think the denoising loop should be implemented in UniLLaDaPipeline.__call__ using a scheduler (such as BlockRefinementScheduler), which is the standard diffusers design, rather than in transformer methods like generate_image.

dg845 · 2026-05-15T10:18:14Z

+# ============================================================
+
+
+class ImageTokenizer:


Suggested change

class ImageTokenizer:

class ImageTokenizer(ModelMixin, ConfigMixin):

I think ImageTokenizer should inherit from ModelMixin and ConfigMixin (which is standard for diffusers models) so that saving and loading can be handled in the normal diffusers way, rather needing to implement it separately in __init__ below.

dg845 · 2026-05-15T10:21:15Z

+OPENAI_CLIP_STD = [0.26862954, 0.26130258, 0.27577711]
+
+
+class ImagePreprocessor:


I think we should refactor the image preprocessing logic in ImagePreprocessor into a dedicated VaeImageProcessor subclass that lives in its own file (e.g. image_processor.py). See for example JoyImageEditImageProcessor as a reference:

diffusers/src/diffusers/pipelines/joyimage/image_processor.py

Line 66 in 68a4847

class JoyImageEditImageProcessor(VaeImageProcessor):

dg845 · 2026-05-15T10:25:21Z

+            attn_impl = getattr(self.config, "_attn_implementation", "eager")
+            if attn_impl != "eager" and attn_impl in ALL_ATTENTION_FUNCTIONS:
+                attention_interface = ALL_ATTENTION_FUNCTIONS[attn_impl]
+                if "flash" in attn_impl:
+                    max_seqlen = (cu_seqlens[1:] - cu_seqlens[:-1]).max()
+                    attn_output, _ = attention_interface(


We should consider using dispatch_attention_fn instead as it handles the attention backends used here, such as Flash Attention (including flash_varlen) and torch native SDPA. For reference, see the attention backend docs.

dg845 · 2026-05-15T10:26:11Z

+        return self.net(x)
+
+
+class SigVQ(nn.Module):


Is the SigVQ model intended to be used as part of the UniLLaDA pipeline? I don't see it being used anywhere.

dg845 · 2026-05-15T10:28:43Z

+import PIL.Image
+
+
+def generate_crop_size_list(


Similar to #13686 (comment), I think we should refactor the image preprocessing logic here into a dedicated VaeImageProcessor subclass (possibly combined with the one from image_tokenizer.py).

dg845

Thanks for the PR! I left an initial design review :).

github-actions Bot added size/L PR with diff > 200 LOC documentation Improvements or additions to documentation models tests utils pipelines schedulers and removed size/L PR with diff > 200 LOC labels May 6, 2026

dg845 requested review from dg845 and yiyixuxu May 14, 2026 08:47

Merge branch 'main' into add-unillada-pipeline

b4a4674

github-actions Bot added the size/L PR with diff > 200 LOC label May 15, 2026

dg845 reviewed May 15, 2026

View reviewed changes

ChinChyi changed the title ~~[UniLLaDA] Add UniLLaDA multimodal discrete diffusion pipeline~~ [LLaDA 2.0-Uni] Add LLaDA 2.0-Uni multimodal discrete diffusion pipeline May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLaDA 2.0-Uni] Add LLaDA 2.0-Uni multimodal discrete diffusion pipeline#13686

[LLaDA 2.0-Uni] Add LLaDA 2.0-Uni multimodal discrete diffusion pipeline#13686
ChinChyi wants to merge 2 commits into
huggingface:mainfrom
ChinChyi:add-unillada-pipeline

ChinChyi commented May 6, 2026 •

edited

Loading

Uh oh!

dg845 May 15, 2026

Uh oh!

dg845 May 15, 2026

Uh oh!

yiyixuxu May 15, 2026

Uh oh!

dg845 May 15, 2026 •

edited

Loading

Uh oh!

dg845 May 15, 2026

Uh oh!

dg845 May 15, 2026

Uh oh!

dg845 May 15, 2026

Uh oh!

dg845 May 15, 2026

Uh oh!

dg845 May 15, 2026

Uh oh!

dg845 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		return torch.cat(result, dim=-1)


		class LLaDA2UniImageTransformer2DModel(ModelMixin, ConfigMixin, PeftAdapterMixin, FromOriginalModelMixin):

		# ============================================================


		class ImageTokenizer:

	class ImageTokenizer:
	class ImageTokenizer(ModelMixin, ConfigMixin):

		OPENAI_CLIP_STD = [0.26862954, 0.26130258, 0.27577711]


		class ImagePreprocessor:

Conversation

ChinChyi commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

New Components

Key Features

Usage Example

Testing

Model Weights

Before submitting

Who can review?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dg845 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ChinChyi commented May 6, 2026 •

edited

Loading

dg845 May 15, 2026 •

edited

Loading